{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
   "metadata": {},
   "source": [
    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "make venv \n",
    "source venv/bin/activate \n",
    "pip install jupyterlab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "## This is here as a reference only\n",
    "# Users and application developers must use the right tag for the latest from pypi\n",
    "%pip install 'data-prep-toolkit[ray]'\n",
    "%pip install 'data-prep-toolkit-transforms[lang_id]'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "##### **** Configure the transform parameters. The set of dictionary keys holding DocIDTransform configuration for values are as follows: \n",
    "| Key name  | Default  | Description |\n",
    "|------------|----------|--------------|\n",
    "| _lang_id_model_credential_ | _unset_ | specifies the credential you use to get model. This will be huggingface token. [Guide to get huggingface token](https://huggingface.co/docs/hub/security-tokens) |\n",
    "| _lang_id_model_kind_ | _unset_ | specifies what kind of model you want to use for language identification. Currently, only `fasttext` is available. |\n",
    "| _lang_id_model_url_ | _unset_ |  specifies url that model locates. For fasttext, this will be repo nme of the model, like `facebook/fasttext-language-identification` |\n",
    "| _lang_id_content_column_name_ | `contents` | specifies name of the column containing documents |\n",
    "| _lang_id_output_lang_column_name_ | `lang` | specifies name of the output column to hold predicted language code |\n",
    "| _lang_id_output_score_column_name_ | `score` | specifies name of the output column to hold score of prediction |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
   "metadata": {},
   "source": [
    "##### ***** Import required classes and modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9669273a-8fcc-4b40-9b20-8df658e2ab58",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_lang_id.ray.transform import LangId"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
   "metadata": {},
   "source": [
    "##### ***** Setup runtime parameters for this transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "badafb96-64d2-4bb8-9f3e-b23713fd5c3f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "06:24:04 INFO - lang_id parameters are : {'model_credential': 'PUT YOUR OWN HUGGINGFACE CREDENTIAL', 'model_kind': 'fasttext', 'model_url': 'facebook/fasttext-language-identification', 'content_column_name': 'text', 'output_lang_column_name': 'lang', 'output_score_column_name': 'score'}\n",
      "06:24:04 INFO - pipeline id pipeline_id\n",
      "06:24:04 INFO - code location None\n",
      "06:24:04 INFO - number of workers 1 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n",
      "06:24:04 INFO - actor creation delay 0\n",
      "06:24:04 INFO - job details {'job category': 'preprocessing', 'job name': 'lang_id', 'job type': 'ray', 'job id': 'job_id'}\n",
      "06:24:04 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n",
      "06:24:04 INFO - data factory data_ max_files -1, n_sample -1\n",
      "06:24:04 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
      "06:24:04 INFO - Running locally\n",
      "2025-01-14 06:24:05,129\tINFO worker.py:1777 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265 \u001b[39m\u001b[22m\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:06 INFO - orchestrator started at 2025-01-14 06:24:06\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:06 INFO - Number of files is 3, source profile {'max_file_size': 0.3023223876953125, 'min_file_size': 0.037346839904785156, 'total_file_size': 0.4433746337890625}\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:06 INFO - Cluster resources: {'cpus': 12, 'gpus': 0, 'memory': 16.166772461496294, 'object_store': 2.0}\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:06 INFO - Number of workers - 1 with {'num_cpus': 0.8, 'max_restarts': -1} each\n",
      "\u001b[36m(RayTransformFileProcessor pid=69404)\u001b[0m Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:08 INFO - Completed 1 files in 0.005 min\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:08 INFO - Completed 2 files in 0.006 min\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:08 INFO - Completed 2 files (66.667%)  in 0.006 min. Waiting for completion\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:09 INFO - Completed processing 3 files in 0.009 min\n",
      "\u001b[36m(orchestrate pid=69369)\u001b[0m 06:24:09 INFO - done flushing in 0.001 sec\n",
      "06:24:19 INFO - Completed execution in 0.248 min, execution result 0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "LangId(input_folder= \"test-data/input\",\n",
    "        output_folder= \"output\",\n",
    "        lang_id_model_credential= \"PUT YOUR OWN HUGGINGFACE CREDENTIAL\",\n",
    "        lang_id_model_kind= \"fasttext\",\n",
    "        lang_id_model_url= \"facebook/fasttext-language-identification\",\n",
    "        run_locally= True,\n",
    "        lang_id_content_column_name= \"text\").transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
   "metadata": {},
   "source": [
    "##### **** The specified folder will include the transformed parquet files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['output/test_03.parquet',\n",
       " 'output/test_02.parquet',\n",
       " 'output/metadata.json',\n",
       " 'output/test_01.parquet']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob\n",
    "glob.glob(\"output/*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
